## PhaseFool: Phase-oriented Audio Adversarial Examples via Energy Dissipation

### Abstract
Audio adversarial attacks design perturbations onto inputs that lead an automatic speech recognition (ASR) model to predict incorrect outputs. Current audio adversarial attacks optimize perturbations with different constraints (e.g. $l_p$-norm for waveform or the principle of auditory masking for magnitude spectrogram) to achieve their imperceptibility. Since phase is not relevant for speech recognition, the existing audio adversarial attacks neglect the influence of phase spectrogram. In this work, we propose a novel phase-oriented algorithm named PhaseFool that can efficiently construct imperceptible audio adversarial examples with energy dissipation. Specifically, we leverage the phase spectrogram and the spectrogram consistency of short-time Fourier transform (STFT) to adversarially dissipate the energy that is crucial for ASR systems. Since the magnitude spectrogram plays a dominant role in human perception, the phase-oriented perturbations cause tiny auditory differences. Experimental results demonstrate that PhaseFool can inherently generate full-sentence imperceptible audio adversarial examples with 100% targeted success rate within 500 steps on average (9.24x speed-up over current state-of-the-art imperceptible counterparts), which is verified through a human study. Most importantly, our PhaseFool is the first to exploit the phase-oriented energy dissipation in the audio adversarial examples rather than add perturbations on the audio waveform like most previous works.

### Highlights
- We leverage the spectrogram consistency of STFT to construct phase-oriented audio adversarial examples, which make further steps towards imperceptibility and efficiency.
- We further investigate the relationship between phase perturbations and magnitude spectrogram, which is called energy dissipation. Instead of the common $l_{p}$ distance metrics or magnitude-oriented metrics based on the psychoacoustic principle, we propose a phase-oriented metric based on the phenomenon of energy dissipation.
- Further investigating properties of PhaseFool, we observe additional insights about the vulnerability of ASR systems. It brings a better interpretation of audio adversarial attacks, i.e. the perturbation positions reveal that which part of one magnitude spectrogram is important but also fragile for the corresponding ASR system. 

### Requirements
- Python 3.7 or later
- Install Pytorch 1.8.0+ (https://pytorch.org/)
- Install torchaudio (https://github.com/pytorch/audio)
- run ``❱❱❱ pip install -r requirements.txt``

### Data and ASR model
#### **Librispeech (English)**
To automatically download the data
```console
❱❱❱ python data/librispeech.py
```
#### **ASR model**
You can obtain the source code and train the model following: https://github.com/gentaiscool/end2end-asr-pytorch

Put your trained model in the folder `./save` and update the configs in the `./run/phase.py` and `./utils/constant.py`.



### Generating the examples

#### Parameters
```
- adv_continue_from: the path of the trained ASR model
- num_attack: the number of audio adversarial examples you want to generate
- max_iterations: the maximum iterations of PhaseFool
```
#### Generate
We have packed the 100 clean audio and targeted transcriptions in our experiments in the `./adv_dataset` folder. You can test the code with the following:
```
CUDA_VISIBLE_DEVICES=0 python ./run/phase.py
```

## Bug Report
Feel free to create an issue

## Reference
[1] Winata, G. I., Cahyawijaya, S., Lin, Z., Liu, Z., & Fung, P. (2019). Lightweight and Efficient End-to-End Speech Recognition Using Low-Rank Transformer. arXiv preprint arXiv:1910.13923. (Accepted by ICASSP 2020)
